1. About this Competition
This is the description about the data and the diagram explain the relationship between each data files This was copied directly from kaggle website https://www.kaggle.com/c/home-credit-default-risk/data
application_{train|test}.csv
This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET). Static data for all applications. One row represents one loan in our data sample.
bureau.csv
All client’s previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
bureau_balance.csv
Monthly balances of previous credits in Credit Bureau. This table has one row for each month of history of every previous credit reported to Credit Bureau - i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
POS_CASH_balance.csv
Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample - i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
credit_card_balance.csv
Monthly balance snapshots of previous credit cards that the applicant has with Home Credit. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample - i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.
previous_application.csv
All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample.
installments_payments.csv
Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample. There is a) one row for every payment that was made plus b) one row each for missed payment. One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample. HomeCredit_columns_description.csv
This file contains descriptions for the columns in the various data files. Data
2. EDA train and test data
reading the train and test data
trainLocation <- "data/application_train.csv"
testLocation <- "data/application_test.csv"
library(tidyverse)
trainData <- read_csv(trainLocation)
testData <- read_csv(testLocation)
trainData
testData
There are around 307511 observations in the train data set and 48744 observation in the tests data set so around 14% of the data is used as test data and 86% is train data. Train data and test data have 121 features, extra 1 target variable in train data. Most features are numeric with type integer and double, some of them are categorical with type character. There are also a lot of NA in the data
Check if there are any constant features in train data so we can remove
library(dplyr)
trainData %>% summarise_all(funs(n_distinct(.)))
Seem like every features have at least 2 distinct value so no constant value
Process numeric features
library(purrr)
library(tidyr)
library(ggplot2)
library(dplyr)
library(tidyimpute)
# get the absolute value of numberic feature
numSet1 <- trainData %>% keep(is.numeric) %>% abs
# replace missing value with the median value of the feature
numSetNoNa <- numSet1 %>% impute_median()
numSetNoNa
summary(numSetNoNa)
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH
Min. :100002 Min. :0.00000 Min. : 0.0000 Min. : 25650 Min. : 45000 Min. : 1616 Min. : 40500 Min. :0.00029 Min. : 7489 Min. : 0 Min. : 0 Min. : 0
1st Qu.:189146 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 112500 1st Qu.: 270000 1st Qu.: 16524 1st Qu.: 238500 1st Qu.:0.01001 1st Qu.:12413 1st Qu.: 933 1st Qu.: 2010 1st Qu.:1720
Median :278202 Median :0.00000 Median : 0.0000 Median : 147150 Median : 513531 Median : 24903 Median : 450000 Median :0.01885 Median :15750 Median : 2219 Median : 4504 Median :3254
Mean :278181 Mean :0.08073 Mean : 0.4171 Mean : 168798 Mean : 599026 Mean : 27108 Mean : 538316 Mean :0.02087 Mean :16037 Mean : 67725 Mean : 4986 Mean :2994
3rd Qu.:367143 3rd Qu.:0.00000 3rd Qu.: 1.0000 3rd Qu.: 202500 3rd Qu.: 808650 3rd Qu.: 34596 3rd Qu.: 679500 3rd Qu.:0.02866 3rd Qu.:19682 3rd Qu.: 5707 3rd Qu.: 7480 3rd Qu.:4299
Max. :456255 Max. :1.00000 Max. :19.0000 Max. :117000000 Max. :4050000 Max. :258026 Max. :4050000 Max. :0.07251 Max. :25229 Max. :365243 Max. :24672 Max. :7197
OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START
Min. : 0.00 Min. :0 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. : 1.000 Min. :1.000 Min. :1.000 Min. : 0.00
1st Qu.: 9.00 1st Qu.:1 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:10.00
Median : 9.00 Median :1 Median :1.0000 Median :0.0000 Median :1.0000 Median :0.0000 Median :0.00000 Median : 2.000 Median :2.000 Median :2.000 Median :12.00
Mean :10.04 Mean :1 Mean :0.8199 Mean :0.1994 Mean :0.9981 Mean :0.2811 Mean :0.05672 Mean : 2.153 Mean :2.052 Mean :2.032 Mean :12.06
3rd Qu.: 9.00 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:14.00
Max. :91.00 Max. :1 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :20.000 Max. :3.000 Max. :3.000 Max. :23.00
REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.01457 Min. :0.0000001 Min. :0.0005273
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.50600 1st Qu.:0.3929737 1st Qu.:0.4170997
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.50600 Median :0.5659614 Median :0.5352763
Mean :0.01514 Mean :0.05077 Mean :0.04066 Mean :0.07817 Mean :0.2305 Mean :0.1796 Mean :0.50431 Mean :0.5145034 Mean :0.5156949
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.50600 3rd Qu.:0.6634218 3rd Qu.:0.6363762
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :0.96269 Max. :0.8549997 Max. :0.8960095
APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0.0876 1st Qu.:0.07630 1st Qu.:0.9816 1st Qu.:0.7552 1st Qu.:0.02110 1st Qu.:0.00000 1st Qu.:0.1379 1st Qu.:0.1667 1st Qu.:0.2083 1st Qu.:0.04810 1st Qu.:0.07560
Median :0.0876 Median :0.07630 Median :0.9816 Median :0.7552 Median :0.02110 Median :0.00000 Median :0.1379 Median :0.1667 Median :0.2083 Median :0.04810 Median :0.07560
Mean :0.1023 Mean :0.08134 Mean :0.9796 Mean :0.7543 Mean :0.02819 Mean :0.03687 Mean :0.1438 Mean :0.1966 Mean :0.2159 Mean :0.05551 Mean :0.08357
3rd Qu.:0.0876 3rd Qu.:0.07630 3rd Qu.:0.9821 3rd Qu.:0.7552 3rd Qu.:0.02110 3rd Qu.:0.00000 3rd Qu.:0.1379 3rd Qu.:0.1667 3rd Qu.:0.2083 3rd Qu.:0.04810 3rd Qu.:0.07560
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE
Min. :0.00000 Min. :0.000000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.07450 1st Qu.:0.000000 1st Qu.:0.0036 1st Qu.:0.08400 1st Qu.:0.07460 1st Qu.:0.9811 1st Qu.:0.7648 1st Qu.:0.0190 1st Qu.:0.00000 1st Qu.:0.1379 1st Qu.:0.1667
Median :0.07450 Median :0.000000 Median :0.0036 Median :0.08400 Median :0.07460 Median :0.9816 Median :0.7648 Median :0.0190 Median :0.00000 Median :0.1379 Median :0.1667
Mean :0.09089 Mean :0.002693 Mean :0.0147 Mean :0.09889 Mean :0.07997 Mean :0.9793 Mean :0.7631 Mean :0.0261 Mean :0.03479 Mean :0.1415 Mean :0.1946
3rd Qu.:0.07450 3rd Qu.:0.000000 3rd Qu.:0.0036 3rd Qu.:0.08400 3rd Qu.:0.07460 3rd Qu.:0.9816 3rd Qu.:0.7648 3rd Qu.:0.0190 3rd Qu.:0.00000 3rd Qu.:0.1379 3rd Qu.:0.1667
Max. :1.00000 Max. :1.000000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2083 1st Qu.:0.04580 1st Qu.:0.07710 1st Qu.:0.07310 1st Qu.:0.000000 1st Qu.:0.00110 1st Qu.:0.0864 1st Qu.:0.07580 1st Qu.:0.9816 1st Qu.:0.7585 1st Qu.:0.02080
Median :0.2083 Median :0.04580 Median :0.07710 Median :0.07310 Median :0.000000 Median :0.00110 Median :0.0864 Median :0.07580 Median :0.9816 Median :0.7585 Median :0.02080
Mean :0.2147 Mean :0.05358 Mean :0.08613 Mean :0.08947 Mean :0.002469 Mean :0.01272 Mean :0.1019 Mean :0.08084 Mean :0.9796 Mean :0.7576 Mean :0.02797
3rd Qu.:0.2083 3rd Qu.:0.04580 3rd Qu.:0.07710 3rd Qu.:0.07310 3rd Qu.:0.000000 3rd Qu.:0.00110 3rd Qu.:0.0864 3rd Qu.:0.07580 3rd Qu.:0.9821 3rd Qu.:0.7585 3rd Qu.:0.02080
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. : 0.000
1st Qu.:0.00000 1st Qu.:0.1379 1st Qu.:0.1667 1st Qu.:0.2083 1st Qu.:0.0487 1st Qu.:0.07610 1st Qu.:0.07490 1st Qu.:0.000000 1st Qu.:0.00310 1st Qu.:0.06700 1st Qu.: 0.000
Median :0.00000 Median :0.1379 Median :0.1667 Median :0.2083 Median :0.0487 Median :0.07610 Median :0.07490 Median :0.000000 Median :0.00310 Median :0.06880 Median : 0.000
Mean :0.03647 Mean :0.1435 Mean :0.1964 Mean :0.2158 Mean :0.0562 Mean :0.08428 Mean :0.09169 Mean :0.002644 Mean :0.01437 Mean :0.08626 Mean : 1.417
3rd Qu.:0.00000 3rd Qu.:0.1379 3rd Qu.:0.1667 3rd Qu.:0.2083 3rd Qu.:0.0487 3rd Qu.:0.07610 3rd Qu.:0.07490 3rd Qu.:0.000000 3rd Qu.:0.00310 3rd Qu.:0.07030 3rd Qu.: 2.000
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :348.000
DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7
Min. : 0.0000 Min. : 0.000 Min. : 0.00000 Min. : 0.0 Min. :0.00e+00 Min. :0.00 Min. :0.00e+00 Min. :0.00000 Min. :0.00000 Min. :0.0000000
1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 274.0 1st Qu.:0.00e+00 1st Qu.:0.00 1st Qu.:0.00e+00 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000000
Median : 0.0000 Median : 0.000 Median : 0.00000 Median : 757.0 Median :0.00e+00 Median :1.00 Median :0.00e+00 Median :0.00000 Median :0.00000 Median :0.0000000
Mean : 0.1429 Mean : 1.401 Mean : 0.09972 Mean : 962.9 Mean :4.23e-05 Mean :0.71 Mean :8.13e-05 Mean :0.01511 Mean :0.08806 Mean :0.0001919
3rd Qu.: 0.0000 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.:1570.0 3rd Qu.:0.00e+00 3rd Qu.:1.00 3rd Qu.:0.00e+00 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000000
Max. :34.0000 Max. :344.000 Max. :24.00000 Max. :4292.0 Max. :1.00e+00 Max. :1.00 Max. :1.00e+00 Max. :1.00000 Max. :1.00000 Max. :1.0000000
FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18
Min. :0.00000 Min. :0.000000 Min. :0.00e+00 Min. :0.000000 Min. :0.0e+00 Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.0000000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.00000
Median :0.00000 Median :0.000000 Median :0.00e+00 Median :0.000000 Median :0.0e+00 Median :0.000000 Median :0.000000 Median :0.00000 Median :0.000000 Median :0.0000000 Median :0.00000
Mean :0.08138 Mean :0.003896 Mean :2.28e-05 Mean :0.003912 Mean :6.5e-06 Mean :0.003525 Mean :0.002936 Mean :0.00121 Mean :0.009928 Mean :0.0002667 Mean :0.00813
3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.000000 Max. :1.00e+00 Max. :1.000000 Max. :1.0e+00 Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.0000000 Max. :1.00000
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
Min. :0.0000000 Min. :0.0000000 Min. :0.0000000 Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. : 0.0000 Min. : 0.0000 Min. : 0.000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 1.000
Median :0.0000000 Median :0.0000000 Median :0.0000000 Median :0.000000 Median :0.000000 Median :0.00000 Median : 0.0000 Median : 0.0000 Median : 1.000
Mean :0.0005951 Mean :0.0005073 Mean :0.0003349 Mean :0.005538 Mean :0.006055 Mean :0.02972 Mean : 0.2313 Mean : 0.2296 Mean : 1.778
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 3.000
Max. :1.0000000 Max. :1.0000000 Max. :1.0000000 Max. :4.000000 Max. :9.000000 Max. :8.00000 Max. :27.0000 Max. :261.0000 Max. :25.000
numSetNoNa contain 106 features including the target variable. This data have no missing value
Process Categorical features
# get the categorical features
catSet <- trainData %>% keep(is.character)
# use the mode of each features to replace its missing value
catSetNoNa <- catSet %>% impute_most_freq
# label encoding categorical features
catSetFactor <- catSetNoNa %>% mutate_all(funs(as.factor))
catSetLabelEncode <- catSetFactor %>% mutate_all(funs(as.numeric))
catSetLabelEncode
Plotting
Create two functions for plotting that we use later
plotDistribtuion <- function(data,plot_type){
data %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key,scales = "free") +
plot_type()
}
plotDistribtuionCoordFlip <- function(data,plot_type){
data %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key,scales = "free") +
plot_type() + coord_flip()
}
We are splitting the numeric features into smaller chunk for plotting
NumCol_1 <- numSetNoNa %>% select(1:9)
NumCol_2 <- numSetNoNa %>% select(10:19)
NumCol_3 <- numSetNoNa %>% select(20:29)
NumCol_4 <- numSetNoNa %>% select(30:39)
NumCol_5 <- numSetNoNa %>% select(40:49)
NumCol_6 <- numSetNoNa %>% select(50:59)
NumCol_7 <- numSetNoNa %>% select(60:69)
NumCol_8 <- numSetNoNa %>% select(70:79)
NumCol_9 <- numSetNoNa %>% select(80:89)
NumCol_10 <- numSetNoNa %>% select(90:99)
NumCol_11 <- numSetNoNa %>% select(100:106)
Plot histogram of numeric features
plotDistribtuion((NumCol_1),geom_histogram)
plotDistribtuion(NumCol_2,geom_histogram)
plotDistribtuion(NumCol_3,geom_histogram)
plotDistribtuion(NumCol_4,geom_histogram)
plotDistribtuion(NumCol_5,geom_histogram)
plotDistribtuion(NumCol_6,geom_histogram)
plotDistribtuion(NumCol_7,geom_histogram)
plotDistribtuion(NumCol_8,geom_histogram)
plotDistribtuion(NumCol_9,geom_histogram)
plotDistribtuion(NumCol_10,geom_histogram)
plotDistribtuion(NumCol_11,geom_histogram)
we have a lot of right skew features and binary features
We are splitting the categorical features into smaller chunk for plotting
mutiTypeCol <- c("NAME_CONTRACT_TYPE","NAME_EDUCATION_TYPE","NAME_FAMILY_STATUS","NAME_HOUSING_TYPE","NAME_INCOME_TYPE","NAME_TYPE_SUITE","OCCUPATION_TYPE","WEEKDAY_APPR_PROCESS_START","FONDKAPREMONT_MODE","WALLSMATERIAL_MODE","HOUSETYPE_MODE")
catCol1 <- catSetNoNa %>% select(mutiTypeCol)
catCol2 <- catSetNoNa %>% select(-mutiTypeCol,-"ORGANIZATION_TYPE")
catCol3 <- catSetNoNa %>% select("ORGANIZATION_TYPE")
Plot bar chart of numeric features
plotDistribtuionCoordFlip(catCol1,geom_bar)
plotDistribtuion(catCol2,geom_bar)
plotDistribtuionCoordFlip(catCol3,geom_bar)
We picks some features that we think it is important to graph the correlation
corFeatures <- numSetNoNa %>% select(TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,CNT_FAM_MEMBERS,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE)
library(corrplot)
M <- round(cor(corFeatures),2)
corrplot(M, method = "number")
If two features are highly correlated to each other we can remove one of them later when building the model
numOnly[266367, 11]
[1] 10116.04
make test data ready like we did with train data
testNumSet <- testData %>% keep(is.numeric) %>% abs
testNumSetNoNa <- testNumSet %>% impute_median()
testCatSet <- testData %>% keep(is.character)
testCatSetNoNa <- testCatSet %>% impute_most_freq
testCatSetFactor <- testCatSetNoNa %>% mutate_all(funs(as.factor))
testCatSetLabelEncode <- testCatSetFactor %>% mutate_all(funs(as.numeric))
readyTestData <- cbind(testNumSetNoNa,testCatSetLabelEncode)
write.csv(readyTestData, file = "data/testReady.csv",row.names=FALSE)